Gritray: From Pixels to Relationships

Physical Intelligence Research • Garlileo Lab

Why Future World Models May Never Need to See Reality

June 22, 2026

Abstract

Modern world models are built upon a simple assumption: if an artificial intelligence system can see enough of the world, it will eventually understand the world. This assumption has driven decades of research in computer vision, robotics, and machine learning. Cameras have become higher in resolution, models have become larger, and datasets have become increasingly massive. Yet a fundamental question remains unresolved: does understanding reality actually require seeing reality as humans do?

This paper proposes an alternative perspective. The future of Physical AI may not depend on reproducing the visual richness of the world, but on extracting the hidden relationships that define it. We introduce Gritray, a visual translation layer that transforms reflected light into structural representations, allowing world models to reason about geometry, interactions, and future outcomes without relying on complete visual reconstruction. Within this framework, Gritray serves as the foundation upon which Fricial, Artifriction, and Resonial operate, creating a layered architecture for future world models.

From Pixels to Relationships

The dominant paradigm in computer vision can be summarized as a simple pipeline: camera, image, recognition, action. A camera captures an image, a neural network identifies objects within that image, and a control system determines the next action. While this approach has produced remarkable achievements, it contains a hidden limitation. Pixels do not contain relationships. An image may contain a table, a cup, a wall, a person, and a door, but identifying those objects does not explain how they interact. It does not explain whether the table supports the cup, whether the person intends to approach the door, or whether a collision is possible.

The physical world is not fundamentally a collection of objects. It is a collection of relationships. Objects are simply participants within those relationships. A world model that understands the relationship between entities may possess a deeper understanding of reality than a model that merely recognizes visual categories. Therefore, the central challenge of Physical AI may not be object recognition, but relational reconstruction.

The Line World Hypothesis

Consider how architects design buildings, how engineers describe machines, or how physicists model motion. They rarely begin with photographs. Instead, they begin with geometry. Lines, edges, surfaces, boundaries, and spatial relationships form the foundation of their representations. A photograph contains enormous amounts of information that may be irrelevant to physical reasoning, including color variations, lighting conditions, reflections, shadows, and visual noise. A geometric representation removes much of this complexity while preserving structure.

Imagine a warehouse observed by a camera. The raw image contains changing weather conditions, shifting sunlight, worker shadows, reflective surfaces, and thousands of visual details. A wireframe representation removes these distractions. Walls become boundaries, shelves become volumes, pathways become traversable graphs, and moving objects become dynamic nodes. The resulting representation may appear less realistic than a photograph, yet it may be far more useful for prediction and decision-making. The robot no longer sees images. The robot sees relationships.

This idea can be described as the Line World Hypothesis: future world models may not require realistic visual representations of reality. Instead, they may require highly accurate relational representations of reality. The goal is not to reconstruct every pixel, but to reconstruct the hidden structure beneath those pixels.

Gritray as a Visual Translation Layer

Gritray proposes that vision should not begin with recognition but with translation. The purpose of Gritray is not to identify what an object is. Its purpose is to infer how objects relate to one another within physical space.

The process can be described as:

Reality
    ↓
Reflected Light
    ↓
Geometric Reconstruction
    ↓
Relationship Graph
    ↓
World Model

Cameras observe reflected light, but reflected light contains far more information than simple appearance. Surface roughness, material composition, geometry, and environmental conditions all influence the behavior of light. By reconstructing edges, depth, surfaces, and spatial structure from these signals, Gritray converts visual observations into a relational graph.

Within this graph, objects are no longer collections of pixels. They become nodes connected by distance, support relationships, collision possibilities, movement constraints, and interaction opportunities. In this sense, Gritray is not a camera system. It is a translation layer between visible light and invisible structure.

Light as a Gateway to Physics

The name Gritray originates from two concepts: Grit, representing the hidden physical interactions embedded within reality, and Ray, representing the light that reveals them.

Every object communicates information through reflected light. A polished metal surface reflects differently from rough concrete. Dry ground reflects differently from wet ground. Rubber behaves differently from ice. Cameras do not directly observe friction, resistance, or material properties. They observe optical signatures generated by these phenomena.

The objective of Gritray is to bridge this gap. Instead of treating vision as a classification problem, it treats vision as a physical inference problem. Light becomes a gateway through which hidden properties of the world can be estimated. Future world models may depend less on recognizing objects and more on inferring the physical constraints that govern those objects.

The Foundation of Fricial

Once relationships have been reconstructed, physical interactions become predictable. This leads directly to the concept of Fricial, the generalized interaction layer of reality.

Traditional friction is only one example. Air resistance, fluid resistance, contact forces, material constraints, energy dissipation, and countless other interactions belong to the same family. Fricial describes how entities influence one another within physical space.

Gritray provides the structure necessary to estimate these interactions. Without structure, interactions remain hidden. Without interactions, structure lacks meaning. Together, Gritray and Fricial transform visual observations into physically meaningful representations capable of supporting prediction and simulation.

From Fricial to Artifriction

Human beings rarely calculate friction coefficients explicitly. Instead, we develop intuition through experience. Artificial systems require a similar capability.

This is where Artifriction emerges. Artifriction represents the internal understanding of physical interactions developed by an AI system. By observing reflected light, geometric structure, movement trajectories, and environmental conditions, a model can estimate probabilities that do not physically exist in reality but exist within its own reasoning process.

These may include slipping probabilities, traversability scores, collision risks, stability estimates, or confidence measures. Artifriction therefore acts as the cognitive layer built upon Fricial. It transforms physical observations into predictive understanding, allowing machines to reason about the consequences of actions before those actions occur.

The Role of Resonial

Even a complete understanding of geometry and interaction remains insufficient for a true world model. The world is not merely physical. It is temporal.

Humans sleep at night. Cities exhibit rush hours. Factories operate on schedules. Biological systems follow cycles. These patterns cannot be explained solely through geometry or force interactions. They emerge from coordination across time.

This coordination layer is described by Resonial. Resonial represents the latent phase relationships that synchronize behavior across agents, systems, and environments. If Gritray reveals structure and Fricial explains interaction, Resonial explains timing. It provides the hidden rhythm that prevents complex systems from descending into chaos.

World models capable of predicting future states must therefore understand not only what interacts, but also when interactions are likely to occur.

Conclusion

The future of world models may not depend on larger neural networks, higher-resolution cameras, or increasingly realistic simulations. The more important challenge may be learning how to see relationships instead of pixels.

Gritray offers a possible path toward this objective by transforming reflected light into geometry, geometry into relationships, and relationships into physically meaningful representations. Within this framework, Gritray reveals structure, Fricial explains interaction, Artifriction learns interaction, and Resonial coordinates interaction across time.

Together, these concepts form a layered architecture:

Reality
    ↓
Light
    ↓
Gritray
(Structure Extraction)

    ↓
Fricial
(Physical Interaction)

    ↓
Artifriction
(AI Understanding)

    ↓
Resonial
(Temporal Coordination)

    ↓
World Model

Future intelligence may not begin with bigger models.

It may begin with learning to see the invisible structure hidden beneath the visible world.